This dataset explores the different chemical properties of red wine that affect its quality.
As shown below the dataset has 1599 rows and 12 columns. The following analysis also confirms that this dataset is relatively tidy.
## [1] 1599
## [1] 12
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1594 6.8 0.620 0.08 1.9 0.068
## 1595 6.2 0.600 0.08 2.0 0.090
## 1596 5.9 0.550 0.10 2.2 0.062
## 1597 6.3 0.510 0.13 2.3 0.076
## 1598 5.9 0.645 0.12 2.0 0.075
## 1599 6.0 0.310 0.47 3.6 0.067
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1594 28 38 0.99651 3.42 0.82
## 1595 32 44 0.99490 3.45 0.58
## 1596 39 51 0.99512 3.52 0.76
## 1597 29 40 0.99574 3.42 0.75
## 1598 32 44 0.99547 3.57 0.71
## 1599 18 42 0.99549 3.39 0.66
## alcohol quality
## 1594 9.5 6
## 1595 10.5 5
## 1596 11.2 6
## 1597 11.0 6
## 1598 10.2 5
## 1599 11.0 6
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The data on quality appears to be normally distributed. Also note that Quality scores are discrete values between 3 and 8.
Fixed Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed Acidity values appear to be slightly skewed to the right. Using a log scale normalises this data.
Volatile Acidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Volatile Acidity values appear to be slightly skewed to the right. Using a log scale normalises this data.
Citric Acid
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric Acid values appear to be skewed to the right. Using a log scale normalises this data. However there are some issues using this log_scale as the data is between 0 and 1.
Residual Sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Residual Sugar values appear to be skewed to the right. Using a log scale does not normalise the data. This suggests that there are some really extreme values for residual sugar.
Chlorides
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chloride values appear to be skewed to the right. Using a log scale normalises the data.
Free Sulfur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free Sulfur Dioxide values appear to be skewed to the right. Using a log scale somewhat normalises this data. It appears that the log scale is a summation of two normal distributions.
Total Sulfur Dioxide
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total Sulfur Dioxide values appear to be skewed to the right. Using a log scale somewhat normalises this data.
Density
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Density values appear to normal.
pH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH values appear to be normal and it seems no tranformation is required.
Sulphates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates values appear to be skewed to the right. Using a log scale normalises this data.
Alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Alcohol values appear to be skewed to the right. Using a log scale does not normalise the data.
There are 1599 varieties of red wine in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality). All the variables in this dataset are numerical. The exception to this is the quality variable which is an integer score that can also be used as an ordered factor variable.
Other observations:
The median quality score is is 6.
Quality Score only range between 3 and 8
A lot of the obervations are measurements that are greater than 0. This combined with the fact that the dataset is relatively small has caused a lot of the datasets to be skewed to the right. Therefore there isn’t enough justification to use a log scale.
The main feature of this dataset is the quality scores of the wines. I would like to determine which combination of chemical properties can be attributed towards these quality scores.
I believe the main features affecting this should be the acidity level, the density and the alcohol content. Other factors may also be of influence. For example the level of fixed acidity and volatile acidity may also affect the ph and therefore have an impact. Similarly residual sugar may have an impact on alcohol content which in turn may affect the quality. Since the quality score is an arbitrary rating given by a judge it is hard to say which features are the most important at this stage of the analysis.
No. I modify variables later once I am able to determine the important variables affecting the quality score (after completing bivariate analysis).
No. I perform changes later once I am able to determine the important variables affecting the quality score (after completing bivariate analysis).
Bivariate Correlation Analysis
The analysis above was done to find any significant bivariate relationships. It tries to capture any linear relationships using a correlation coeffiecient. Coefficent close to 1 indicate a strong direct relationship between factors and coefficents close to -1 indicate a strong inverse relationship between factors.
It seems that the quality score is affected by the following factors:
Volatile Acidity (Correlation -0.391)
Citric acid (Correlation 0.226)
Sulphates (Correlation 0.251)
Alcohol (Correlation 0.476)
Of these I think alcohol and volatile acidity are the most significant.
The following other relationships were also observed:
Volatile acidity and fixed acidity (Correlation -0.256)
Fixed acidity and citric acid (Correlation 0.672)
Volatile acidity and citric acid (Correlation -0.552)
Density and fixed acidity (Correlation 0.668)
Density and citric acid (Correlation 0.365)
pH and fixed acidity (Correlation -0.683)
pH and volatile acidity (Correlation 0.235)
pH and citric acid (Correlation -0.542)
pH and density (Correlation -0.342)
sulphates and volatile acidity (Correlation -0.261)
sulphates and citric acid (Correlation 0.313)
alcohol and density (Correlation -0.496)
alcohol and pH (Correlation 0.206)
Of these I think the following relationships are the most important:
Quality vs Volatile Acidity
As shown above there appears to be a negative relationship between quality and volatile acidity. However it does not seem that this relationship is linear.
Quality vs Citric Acid
It seems that the relationship between citric acid and quality is almost not existent. The downward trend at the end is skewed by an extreme value.
Quality vs Sulphates
There does not seem to be any relationship between quality and sulphates.
Quality vs Alcohol
Alcohol seems to improve quality up to a certain point (13% alcohol). However as shown this relationship is not linear.
Fixed acidity vs Density
There is some evidence of a linear relationship between density and fixed acidity.
Fixed acidity vs pH
There is some evidence of an inverse linear relationship between pH and fixed acidity.
Fixed acidity vs Citric Acid
There is some evidence of a linear relationship between citric acid and fixed acidity. The downtrend at the end is misleading as it is the result of only one extreme value.
It seems that the quality score is affected by the following factors:
Volatile Acidity (Correlation -0.391)
Citric acid (Correlation 0.226)
Sulphates (Correlation 0.251)
Alcohol (Correlation 0.476)
Of these I think alcohol and volatile acidity are the most significant.
The following relationships were observed:
Volatile acidity and fixed acidity (Correlation -0.256)
Fixed acidity and citric acid (Correlation 0.672)
Volatile acidity and citric acid (Correlation -0.552)
Density and fixed acidity (Correlation 0.668)
Density and citric acid (Correlation 0.365)
pH and fixed acidity (Correlation -0.683)
pH and volatile acidity (Correlation 0.235)
pH and citric acid (Correlation -0.542)
pH and density (Correlation -0.342)
sulphates and volatile acidity (Correlation -0.261)
sulphates and citric acid (Correlation 0.313)
alcohol and density (Correlation -0.496)
alcohol and pH (Correlation 0.206)
Of these I think the following relationships are the most important:
A negative relationship between pH and fixed acidity (Correlation -0.683)
Quality, volatile acidity and alcohol
Since alcohol and volatile acidity were the most signicant factors in affecting quality score, I decided to analyse this relationship further. In this case I decided to use the rainbow ROYGBIV (Red, Orange Yellow, Green, Blue, Indigo, Violet) colour scale for alcohol. This is because we know that the optimum score for quality does not occur at extremely high or low values of alcohol but rather values in between. Using a ROYGBIV scale allows us to discern these level more easily than if we used a single sequential colour scale.
As shown in the plot and model above there is definately some evidence of a relationship between alcohol, volatile acidity and quality. We can explore this relationship further by discretising alcohol and volatile acidity.
Quality, volatile acidity and alcohol - A discretised Analysis
The graph above discretises alcohol and volatile acidity so that we can better see what is happening.However as shown it can be hard to discern the colours when they overlap each other due to the jitter property. One way to overcome this problem is to use facets to separate these colours and see which one of them occurs the most in a given region. The following plot shows how we can do this.
As shown in the plot above, there is some evidence that both alcohol and volatile acidity affect quality scores.
Quality, volatile acidity and alcohol - Regression Models
Given the relationship shown above, I decided to model quality, volatile acidity and alcohol. I tested a number of models including some that included other variables and some that use logarithmic functions. However, I found that most models had similar r squared values to the simple model shown below. I decided to choose this simple model, as simpler models tend to have higher predictive power than complex models.
##
## Call:
## lm(formula = I(quality) ~ I(alcohol) + I(volatile.acidity), data = pf)
##
## Coefficients:
## (Intercept) I(alcohol) I(volatile.acidity)
## 3.0955 0.3138 -1.3836
##
## Call:
## lm(formula = I(quality) ~ I(alcohol) + I(volatile.acidity), data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59342 -0.40416 -0.07426 0.46539 2.25809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.09547 0.18450 16.78 <2e-16 ***
## I(alcohol) 0.31381 0.01601 19.60 <2e-16 ***
## I(volatile.acidity) -1.38364 0.09527 -14.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6678 on 1596 degrees of freedom
## Multiple R-squared: 0.317, Adjusted R-squared: 0.3161
## F-statistic: 370.4 on 2 and 1596 DF, p-value: < 2.2e-16
Based on previous analysis I decided to test a discretised version of this model. The results are shown below.
##
## Call:
## lm(formula = I(quality) ~ I(alcohol.level) + I(volatile.acidity.level),
## data = pf)
##
## Coefficients:
## (Intercept) I(alcohol.level).L
## 5.17117 0.90354
## I(alcohol.level).Q I(alcohol.level).C
## -1.39222 -0.68555
## I(alcohol.level)^4 I(alcohol.level)^5
## -0.59854 -0.12606
## I(alcohol.level)^6 I(alcohol.level)^7
## -0.22484 -0.05995
## I(volatile.acidity.level).L I(volatile.acidity.level).Q
## -2.98340 -0.63900
## I(volatile.acidity.level).C I(volatile.acidity.level)^4
## -0.56883 -0.31472
## I(volatile.acidity.level)^5 I(volatile.acidity.level)^6
## -0.59662 -0.58337
## I(volatile.acidity.level)^7 I(volatile.acidity.level)^8
## -0.64377 -0.19910
## I(volatile.acidity.level)^9 I(volatile.acidity.level)^10
## -0.15420 0.09751
## I(volatile.acidity.level)^11 I(volatile.acidity.level)^12
## 0.06453 0.15364
## I(volatile.acidity.level)^13
## 0.11735
##
## Call:
## lm(formula = I(quality) ~ I(alcohol.level) + I(volatile.acidity.level),
## data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6134 -0.3603 -0.1082 0.4384 2.1345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.17117 0.12493 41.393 < 2e-16 ***
## I(alcohol.level).L 0.90354 0.42219 2.140 0.032498 *
## I(alcohol.level).Q -1.39222 0.41669 -3.341 0.000854 ***
## I(alcohol.level).C -0.68555 0.33872 -2.024 0.043143 *
## I(alcohol.level)^4 -0.59854 0.23951 -2.499 0.012553 *
## I(alcohol.level)^5 -0.12606 0.15286 -0.825 0.409666
## I(alcohol.level)^6 -0.22484 0.09275 -2.424 0.015461 *
## I(alcohol.level)^7 -0.05995 0.05513 -1.087 0.276984
## I(volatile.acidity.level).L -2.98340 0.40220 -7.418 1.94e-13 ***
## I(volatile.acidity.level).Q -0.63900 0.39641 -1.612 0.107170
## I(volatile.acidity.level).C -0.56883 0.38064 -1.494 0.135268
## I(volatile.acidity.level)^4 -0.31472 0.35490 -0.887 0.375323
## I(volatile.acidity.level)^5 -0.59662 0.31584 -1.889 0.059074 .
## I(volatile.acidity.level)^6 -0.58337 0.29118 -2.003 0.045299 *
## I(volatile.acidity.level)^7 -0.64377 0.26817 -2.401 0.016482 *
## I(volatile.acidity.level)^8 -0.19910 0.23728 -0.839 0.401550
## I(volatile.acidity.level)^9 -0.15420 0.22305 -0.691 0.489448
## I(volatile.acidity.level)^10 0.09751 0.20844 0.468 0.639998
## I(volatile.acidity.level)^11 0.06453 0.16828 0.383 0.701436
## I(volatile.acidity.level)^12 0.15364 0.11844 1.297 0.194755
## I(volatile.acidity.level)^13 0.11735 0.07833 1.498 0.134270
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6651 on 1578 degrees of freedom
## Multiple R-squared: 0.3302, Adjusted R-squared: 0.3217
## F-statistic: 38.9 on 20 and 1578 DF, p-value: < 2.2e-16
As shown this model produces a slightly better r squared. Intuitively this makes sense as there is likely to be small rounding errors in data measurement. Using rounded values is likely to overcome this problem.
Fixed Acidity, Density, pH & Citric Acid
My previous analysis indicates that there is a relationship between fixed acidity, density, pH and citric acid. I decided to explore this relationship further. As it is difficult to explore the relationship between 4 variables I decided to look at 3 variables at a time. First I explored the relatinship between fixed acidity, density and citric acid. The results are shown in the plot below.
As shown fixed acidity increases when either density or citric acid values increase.
Similarly, a relationship between fixed acidity, density and pH can be seen in the plot below.
As shown fixed acidity increases when either density increases but decreases when pH increases.
Having seen that all these 4 variables are related, I decided to combine them in one visualisation so that I could better understand their relationship. In this case I was able to achieve this by discretising values of citric acid.
As shown in the graph above there is a clear relationship between citric acid, density, fixed acidity and pH levels. This is further confirmed in the regression model below.
##
## Call:
## lm(formula = I(fixed.acidity) ~ I(citric.acid) + I(density) +
## I(pH), data = pf)
##
## Coefficients:
## (Intercept) I(citric.acid) I(density) I(pH)
## -371.809 2.844 394.252 -4.111
##
## Call:
## lm(formula = I(fixed.acidity) ~ I(citric.acid) + I(density) +
## I(pH), data = pf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6657 -0.5179 0.0037 0.5350 4.8048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -371.8088 12.7339 -29.20 <2e-16 ***
## I(citric.acid) 2.8441 0.1372 20.73 <2e-16 ***
## I(density) 394.2516 12.6660 31.13 <2e-16 ***
## I(pH) -4.1108 0.1715 -23.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8745 on 1595 degrees of freedom
## Multiple R-squared: 0.7482, Adjusted R-squared: 0.7477
## F-statistic: 1580 on 3 and 1595 DF, p-value: < 2.2e-16
As shown in the visualisation above there may be an ideal range of alcohol and volatile acidity levels that maximises quality. However this combination is not a simple linear relationship. It seems that the optimal alcohol level is between 13 and 14 and the optimum volatile acidity level is between 0.3 and 0.8. The low r squared term suggest that there are other elements that are missing in this investigation.
As shown in the graph above there is a clear relationship between citric acid, density, fixed acidity and pH levels. From this relationship we can see that citric acid (or components that help make it) is an influential ingredient that affects density, fixed acidity and pH levels of red wine.
The model that I finally decided to use involved quality as a function of discretised alcohol and volatile acidity levels. I decided to use this model because it was the only simple model and it had a reasonable r squared value (over 0.3). While other factors did improve the r squared scored their impact was negligible. As explained above this model is relatively limited. Much of the variation in this model is left unexplained and it is likely that we are missing important information that could help us predict quality.
This plot summarises the relationship between quality scores, alcohol, and volatile acidity. I chose to use this plot as it clearly shows that there is an optimum range for alcohol and volatile acidity that maximises quality scores. As shown in the plot, this relationship is not purely linear and more information is required to predict the quality score more accurately.
This plot shows the relationship between fixed acidity, density, citric acid and pH. It is important as it clearly shows that fixed acidity increases when either density increases or citric acid increases but decreases as pH increases.
This plot tries to combine the information shown in the two plots in plot 2 so that we can see the relationship between fixed acidity, density, PH and citric acid levels. It does this by discretising citric acid levels so that we can use facets that see how pH and density affect fixed acidity. We can also see how fixed acidity increases when we move to higher discretised levels of citric acid.
In this exploration the main objective was to explore the relationship between quality scores and the chemical properties of red wine. Our initial analysis showed that we had a limited range of data which cause a lot of our data for individual measurements to be skewed to the right. Doing a bivariate analysis suggested that a relationship existed between quality scores, volatile acidity and alcohol. Further analysis allowed us to plot this relationship and do a linear regression. It seems that a simple or logarithmic regression is not sufficient. Further analysis using step functions seems to be required. More data and time is required to verify this. There is some indication that a step-based relationship could be used as there are ideal ranges of alchohol and volatile acidity where quality scores are maximised.
In addition a relationship was also found between fixed acidity density, pH and citric acid. This relationship had a much better r squared value when regression was performed and was confirmed by several plots. This suggests that citric acid (or some components of it) have a significant impact on density, pH and fixed acidity.